Quantized Stochastic Gradient Descent: Communication versus Convergence
ثبت نشده
چکیده
Parallel implementations of stochastic gradient descent (SGD) have received signif1 icant research attention, thanks to excellent scalability properties of this algorithm, 2 and to its efficiency in the context of training deep neural networks. A fundamental 3 barrier for parallelizing large-scale SGD is the fact that the cost of communicat4 ing the gradient updates between nodes can be very large. Consequently, lossy 5 compression heuristics have been proposed, by which nodes only communicate 6 quantized gradients. Although effective in practice, these heuristics do not always 7 provably converge, and it is not clear whether they are optimal. In this paper, we 8 propose Quantized SGD (QSGD), a family of compression schemes which allow 9 the compression of gradient updates at each node, while guaranteeing convergence 10 under standard assumptions. QSGD allows the user to trade off compression and 11 convergence time: it can communicate a sublinear number of bits per iteration 12 in the model dimension, and can achieve asymptotically optimal communication 13 cost. We complement our theoretical results with empirical data, showing that 14 QSGD can significantly reduce communication cost, while being competitive with 15 standard uncompressed techniques on a variety of real tasks. 16
منابع مشابه
QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanksto excellent scalability properties of this algorithm, and to its efficiency in the context of training deep neural networks.A fundamental barrier for parallelizing large-scale SGD is the fact that the cost of communicating the gradient updatesbetween nodes can be ve...
متن کاملQSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradient...
متن کاملPreserving communication bandwidth with a gradient coding scheme
Large–scale machine learning involves the communicaiton of gradients, and large models often saturate the communication bandwidth to communicate gradients. I implement an existing scheme, quantized stochastic gradient descent (QSGD) to reduce the communication bandwidth. This requires a distributed architecture and we choose to implement a parameter server that uses the Message Passing Interfac...
متن کاملConditional Accelerated Lazy Stochastic Gradient Descent
In this work we introduce a conditional accelerated lazy stochastic gradient descent algorithm with optimal number of calls to a stochastic first-order oracle and convergence rate O( 1 ε2 ) improving over the projection-free, Online Frank-Wolfe based stochastic gradient descent of Hazan and Kale [2012] with convergence rate O( 1 ε4 ).
متن کاملWithout-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization
Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled with replacement. In practice, however, sampling without replacement is very common, easier to implement in many cases, and often performs better. In this paper, we provide competitive convergence guarantees for without-replacement sampling, under various scenarios, f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016